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Abstract 

We apply kernel-based methods to solve the difficult reinforcement learning problem of 
3vs2 keepaway in RoboCup simulated soccer. Key challenges in keepaway are the high- 
dimensionality of the state space (rendering conventional discretization-based function ap- 
proximation like tilecoding infeasible) , the stochasticity due to noise and multiple learning 
agents needing to cooperate (meaning that the exact dynamics of the environment are un- 
known) and real-time learning (meaning that an efficient online implementation is required). 
We employ the general framework of approximate policy iteration with least-squares-based 
policy evaluation. As underlying function approximator we consider the family of regular- 
ization networks with subset of regressors approximation. The core of our proposed solution 
is an efficient recursive implementation with automatic supervised selection of relevant ba- 
sis functions. Simulation results indicate that the behavior lear ned through our ap proach 
clearly outperforms the best results obtained with tilecoding by IStone et al.l (120051 ) . 
Keywords: Reinforcement Learning, Least-squares Policy Iteration, Regularization Net- 
works, RoboCup 
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1. Introduction 

RoboCup simulated soccer has been conceived and is widely accepted as a common plat- 
form to address various challenges in artificial intelligence and robotics research. Here, we 
consider a subtask of the full problem, namely the keepaway problem. In keepaway we have 
two smaller teams: one team (the 'keepers') must try to maintain possession of the ball 
for as long as possible while staying within a small region o f the full so c cer fi eld. The 
other team (the 'takers') tries to gain possession of the ball. IStone et al.1 (|2005l ) initially 
formulated keepaway as benchmark problem for reinforcement learning (RL); the keepers 
must individually learn how to maximize the time they control the ball as a team against 
the team of opposing takers playing a fixed strategy. The central challenges to overcome 
are, for one, the high dimensionality of the state space (each observed state is a vector 
of 13 measurements), meaning that conventional approaches to function approximation in 
RL, like grid-based tilecoding, are infeasible; second, the stochasticity due to noise and 
the uncertainty in control due to the multi-agent nature imply that the dynamics of the 
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environment are both unknown and cannot be obtained easily. Hence we need model-free 
methods. Finally, the underlying soccer server expects an action every 100 msec, meaning 
that efficient methods are necessary that are able to learn in real-time. 

Stone et al.1 (j2005h successfully applied RL to keepaway, using the textbook approach 



with online Sarsa(A) and tilecoding as underlying function approximator ( Sutton and Bartol . 
1998). However, tilecoding is a local method and places parameters (i.e. basis functions) in a 
regular fashion throughout the entire state space, such t hat the number of parameters grows 
exponentially with the dimensionality of the space. In (jStone et all boOfJ ) this very serious 
shortcoming was adressed by exploiting problem-specific knowledge of how the various state 
variables interact. In particular, each state variable was considered independently from the 
rest. Here, we will demonstrate that one can also learn using the full (untampered) state 
information, without resorting to simplifying assumptions. 

In this paper we propose a (non-parametric) kernel-based approach to approximate the 
value function. The rationale for doing this is that by representing the solution through the 
data and not by some basis functions chosen before the data becomes available, we can better 
adapt to the complexity of the unknown function we are trying to estimate. In particular, 
parameters are not 'wasted' on parts of the input space that are never visited. The hope is 
that thereby the exponential growth of parameters is bypassed. To solve the RL problem of 
optimal control we consider the framework of approximate policy iter ation with the related 
least-s quares based policy evaluati on methods L SPE(A) proposed by Nedic and Bertsekasl 
(|2003h and LSTD(A) proposed by iBovanl (|l999h . .^east-squares based policy evaluation 
is ideally suited for the use with linear models and is a very sample-efficient variant of 
RL. In this paper we provide a unified and concise formulation of LSPE and LSTD; the 
approximated value function is obtained from a regul arization network which i s effec tively 
the mean of the posterior obtained b y GP regression dRasmussen and Williams . 20061) . We 
use the subset of regressors method (jSmola and Scholkopi 120001 : iLuo and Wahbal . 119971 ) to 
approximate the kernel using a much reduced s ubset of basis func t ions. To select this subse t 
we employ greedy online selection, similar to ( Csato and Opper . 2001 ; Engel et al. . 20031 ). 



that adds a candidate basis function based on its distance to the span of the previously 
chosen ones. One improvement is that we consider a supervised criterion for the selection 
of the relevant basis functions that takes into account the reduction of the cost in the 
original learning task in addition to reducing the error incurred from approximating the 
kernel. Since the per-step complexity during training and prediction depends on the size 
of the subset, making sure that no unnecessary basis functions are selected ensures more 
efficient usage of otherwise scarce resources. In this way learning in real-time (a necessity 
for keepaway) becomes possible. 

This paper is structured in three parts: the first part (Section [2]) gives a brief intro- 
duction on reinforcement learning and carrying out general regression with regularization 
networks. The second part (Section [3]) describes and derives an efficient recursive imple- 
mentation of the proposed approach, particularly suited for online learning. The third part 
describes the RoboCup-keepaway problem in more detail (Section H]) and contains the re- 
sults we were able to achieve (Section [5]). A longer discussion of related work i s deferred to 
the end of the paper; there we compare the similarities of our work with that of lEngel et al 
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2. Background 

In this section we briefly review the subjects of RL and regularization networks. 



2.1 Reinforcement Learning 

Reinforcement learning (RL) is a simulation-b ased form of approximate dynamic program- 
ming, e.g. see ( Bertsekas and Tsitsiklis . 19961 ). Consider a discrete-time dynamical system 



with states S = {1, . . . , N} (for ease of exposition we assume the finite case). At each time 
step t, when the system is in state St, a decision maker chooses a control-action at (again, 
selected from a finite set A of admissible actions) which changes probabilistically the state 
of the system to st+i, with distribution P(st+i\st,at). Every such transition yields an im- 
mediate reward r t +\ = R(st+i\s t ,at). The ultimate goal of the decision-maker is to choose 
a course of actions such that the long-term performance, a measure of the cumulated sum 
of rewards, is maximized. 



2.1.1 Model-free Q- value function and optimal control 

Let 7r denote a decision-rule (called the policy) that maps states to actions. For a fixed 
policy 7r we want to evaluate the state-action value function (Q-function) which for every 
state s is taken to be the expected infinite-horizon discounted sum of rewards obtained from 
starting in state s, choosing action a and then proceeding to select actions according to ir: 

Q n (s,a) :=E^^J2^r t+ i\s = s,ao = a\ Vs,a (1) 

where st+i ~ P(- \st,Tr(st)) and r t +\ = R(st+i\st,ir(st)). The parameter 7 £ (0,1) denotes 
a discount factor. 

Ultimately, we are not directly interested in Q 71 ; our true goal is optimal control, i.e. we 
seek an optimal policy tt* = argmax,,. Q w . To accomplish that, policy iteration interleaves 
the two steps policy evaluation and policy improvement: First, compute Q nk for a fixed 
policy 71"/%. Then, once Q Wk is known, derive an improved policy iTk+l by choosing the 
greedy policy with respect to Q Wk , i.e. by by choosing in every state the action irk+i(s) = 
argmax a Q Kk (s, a) that achieves the best Q-value. Obtaining the best action is trivial if we 
employ the Q-notation, otherwise we would need the transition probabilities and reward 
function (i.e. a 'model'). 

To compute the Q-function, one exploits the fact that Q n obeys the fixed-point relation 
Q n = TnQ 77 , where T n is the Bellman operator 

(%Q)(s,a) := E 8 ,„p(. | S)(l ) {R(s'\s,a) + *yQ(s', 7r(s'))} • 

In principle, it is possible to calculate Q w exactly by solving the corresponding linear system 
of equations, provided that the transition probabilities P(s'\s,a) and rewards R(s'\s, a) are 
known in advance and the number of states is finite and small. 

However, in many practical situations this is not the case. If the number of states is 
very large or infinite, one can only operate with an approximation of the Q-function, e.g. a 
linear approximation Q(s, a; w) = 4> m (s, a) T w, where <t> m (s, a) is an m-dimensional feature 
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Figure 1: Approximate policy iteration framework. 



vector and w the adjustable weight vector. To approximate the unknown expectation value 
one employs simulation (i.e. an agent interacts with the environment) to generate a large 
number of observed transitions. Figure [T] depicts the resulting approximate policy iteration 
framework: using only a parameterized Q and sample transitions to emulate application 
of means that we can carry out the policy evaluation step only approximately. Also, 
using an approximation of Q Wk to derive an improved policy from does not necessarily mean 
that the new policy actually is an improved one; oscillations in policy space are possible. 
In practice however, approximate policy iteration i s a fairly sound procedure tha t either 
converges or oscillates with bounded suboptimality ( Bertsekas and Tsitsiklis . 19961 ). 



Inferring a parameter vector w& from sample transitions such that Q(- ;w&) is a good 
approximation to Q Wk is therefore the central problem addressed by reinforcement learning. 
Chiefly two questions need to be answered: 

1. By what method do we choose the parametrisation of Q and carry out regression? 

2. By what method do we learn the weight vector w of this approximation, given sample 
transitions? 

The latter can be solved by the family of temporal difference learning, with TD(A), ini- 
tially proposed by Sutton ( 19881 ). being its m ost prominent member. Using a linearly 



parametrized value function, it was in shown in (jTsitsikfis and Royi . ll997l ) that TD(A) con 



verges against the true value function (under certain technical assumptions). 

2.1.2 Approximate policy evaluation with least-squares methods 

In what follows we will discuss three related algorithms for approximate policy evalua- 
tion that share most of the advantages of TD(A) but converge much faster, since they 
are based on solving a least-squares problem in closed form, whereas TD(A) is based on 
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stochastic gradient descent. All three methods assume that an (infinitely) long^ trajectory 
of states and rewards is generated using a simulation of the system (e.g. an agent inter- 
acting with its environment). The trajectory starts from an initial state so and consists 
of tuples (so,«q), (si,ai), . . . and rewards ri,r2, . . . where action a« is chosen according to 
7T and successor states and associated rewards are sampled from the underlying transition 
probabilities. From now on, to abbreviate these state-action tuples, we will understand 
as denoting := (st, at). Furthermore, we assume that the Q-function is parameterized by 
Q 7r (x; w) = m (x) T w and that w needs to be determined. 

The LSPE (A) method. The metho d A-least squares policy evaluation LSPE(A) was 
proposed by Nedic and Bertsekas (|2003l ): Bertsekas et al. ( 20041 ) and proceeds by making 



incremental changes to the weights w. Assume that at time t (after having observed t 
transitions) we have a current weight vector wj and observe a new transition from x^ to 
x t+ i with associated reward r t +i- Then we compute the solution w t+ i of the least-squares 
problem 



w m = argmin^ <j <£ m (x;) T w - m (xj) T w t - ^{\^) k l d(x k ,x. k+1 ;w t ) j> (2) 

where 



w 



d(x k ,yL k+1 ;w t ) := r k+l + 70 m (x A . +1 ) T w i - </> m (x fc ) T w t . 
The new weight vector w^+i is obtained by setting 

w t+ i = w t + f7t(w t+ i - w t ) (3) 

where wo is the initial weight vector and < rjt < 1 is a diminishing step size. 

The LSTD (A) method. The le ast-squares tempor al difference m ethod LSTD(A) pro- 
posed by Bradtke and Barto (jl996l ) for A = and by Boyan ( 19991 ) for general A G [0, 1] 



does not proceed by making incremental changes to the weight vector w. Instead, at time 
t (after having observed t transitions), the weight vector Wt+i is obtained by solving the 
fixed-point equation 



t ( t 



2 



w = argmin V <^ m (xi) T w - m (xj) T w - V(A7) fc l d(x fe , x fc+1 ; w) } (4) 



i=0 K. k=i 



for w, where 

d(x fc ,x fc+ i;w) := r k+1 +70 m (x fc+ i) T w - (^ m (x fc ) T w, 
and setting w t+ i to this unique solution. 



1. If we are dealing with an episodic learning task with designated terminal states, we can generate an 
infinite trajectory in the following way: once an episode ends, we set the discount factor 7 to zero and 
make a zero-reward transition from the terminal state to the start state of the next (following) episode. 
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BRM 

Corresponds to TD(0) 
Deterministic transitions only 
No OPI 

Explicit least-squares 

=> Supervised basis selection 



LSTD 

Corresponds to TD(A) 
Stochastic transitions possible 
No OPI 

Least-squares only implicitly 
=>■ No supervised basis selection 



LSPE 

Corresponds to TD(A) 

Stochastic transitions possible 

OPI possible 

Explicit least-squares 

=> Supervised basis selection 



Table 1: Comparison of least-squares policy evaluation 



Comparison of LSPE and LSTD. The similarities and differences between LSPE(A) 
and LST D (A) are listed i n Tab le CD Both LSPE(A) and LSTD(A) converge to the same 



limit (see iBertsekas et al.l . |2004| ). which is also the limit to which TD(A) converges (the 
initial iterates may be vastly different though). Both methods rely on the solution of a 
least-squares problem (either explicitly as is the case in LSPE or implicitly as is the case in 
LSTD) and can be efficiently implemented usi ng recursive computat i ons. Computational 
experiments in ( Bertsekas and Ioffe . 19961 ) or ( Lagoudakis and Parr . 20031 ) indicate that 
both approaches can perform much better than TD(A). 

Both methods LSPE and LSTD differ as far as their role in the approximate policy 
iteration framework is concerned. LSPE can take advantage of previous estimates of the 
weight vector and can hence be used in the context of optimistic policy iteration (OPI), i.e. 
the policy under consideration gets improved following very few observed transitions. For 
LSTD this is not possible; here a more rigid actor-critic approach is called for. 

Both methods LSPE and LSTD also differ as far as their relation to standard regression 
with least-squares methods is concerned. LSPE directly minimizes a quadratic objective 
function. Using this function it will be possible to carry out 'supervised' basis selection, 
where for the selection of basis functions the reduction of the costs (the quantity we are 
trying to minimize) is taken into account. For LSTD this is not possible; here in fact we 
are solving a fixed point equation that employs least-squares only implicitly (to carry out 
the projection). 

The BRM method. A third approach, relat ed to LSTD(O) is the direct minimizatio n 
of the Bellman residuals (BRM), as proposed in (|Bairdl . llflflfJ : lLagoudakis and Pari! l2003h . 
Here, at time i, the weight vector Wf + i is obtained from solving the least-squares problem 



W( + i = argmm 



i=0 



0r. 



W 



S Si 



■K(si)) R(s'\si,ir(si)) +70 m (s',7r(s')) T 



w 



Unfortunately, the transition probabilities can not be approximated by using single samples 
fro m the traject ory; one would need 'doubled' samples to obtain an unbiased estimate 
(see iBairdl . Il99fj ). Thus this method would be only applicable for tasks with deterministic 
state transitions or known state dynamics; two conditions which are both violated in our 
application to RoboCup-keepaway. Nevertheless we will treat the deterministic case in first 
place during all our derivations, since LSPE and LSTD require only very minor changes 
to the resulting implementation. Using BRM with deterministic transitions amounts to 
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solving the least-squares problem 



t 

w t+ i = argmin V I m (xi) T w - r i+1 - 70 m (x m ) T w } 



(5) 



i=0 



2.2 Standard regression with regularization networks 

From the foregoing discussion we have seen that (approximate) policy evaluation can amount 
to a traditional function approxim ation problem. For this purpose we will here consider the 

family of regularization networks (jGirosi et al.l . Il995l ) , which are functi onally equivalent to 

kerne l ridge regression and Bayesian regression with Gaussian processes ( Rasmussen and Williamsl . 
20061 ). He re however, we will int r oduce them from the non-Bayesian regularization perspec- 
tive as in ( Smola and Scholkopf . 2000l ). 



2.2.1 Solving the full problem 

Given t training examples {xj, ?/i}* =1 with inputs Xj and observed outputs yi, to reconstruct 
the underlying function, one considers candidates from a funct ion space Hk, where T~Lk is a 
reproducing kernel Hilbert space with reproducing kernel k (e.g. Wahba . 199dl ). and searches 
among all possible candidates for the function / G %k that achieves the minimum in the risk 
functional YKVi ~ /( x «)) 2 + 0-2 ll/l|-H fc - The scalar a 2 is a regularization paramete r. Since 
solut ions to this variational problem may be represented through the data alone (jWahbal . 
1990) as /(•) = ^/c(xj,-)wj, the unknown weight vector w is obtained from solving the 
quadratic problem 

min (Kw — y) T (Kw — y) + cr 2 w T Kw (6) 



The solution to (J6|) is w = (K + a 2 T) 1 y, where y 



(2/1, 



yt) T and K is the t x t kernel 



matrix [K 



k (X2 , Xj 



2.2.2 Subset of regressor approximation 

Often, one is not willing to solve the full t-by-t problem in ([6]) when the number of training 
examples t is large and instead considers means of approximation. In the subset of regres- 
sors (SR) approach (IPoggio and Girosil . Il990l : iLuo and Wahbal . 119971 : ISmola and Scholkopi 
2000l ) one chooses a subset {x^}^ of the data, with m <C t, and approximates the kernel 
for arbitrary x, x' by taking 



A;(x,x ) — k m (x) K mm k m (x ). 



(7) 



Here k m (x) denotes the m x 1 feature vector k m (x) = (fe(xi, x), . . . , k(St m , x)) T and the 



m x m matrix K mm is the submatrix [K 



mm\ij 



A;(xj,Xj) of the full kernel matrix K. 



Replacing the kernel in © by expression ([7]) gives 

min (K tm w - y) T (K tm w - y) + o- 2 w T K mm w 



welt 



with solution 



+ cr 2 K mm ) Kj m y 



(8) 
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where Kf m is the txm submatrix [Ktm]ij = k(xi,5tj) corresponding to the m columns of the 
data points in the subset. Learning the weight vector wj from ([8]) costs 0(tm 2 ) operations. 
Afterwards, predictions for unknown test points x* are made by /(x*) = k m (x*) T w at 
0(m) operations. 

2.2.3 Online selection of the subset 

To choose the subset of relevant basis functions (termed the dictionary or set of basis vectors 
BV) many different approaches are possible; typically they can be disting uished as being un- 



super vised or supervised. Unsupervised approach es like random selection (IWilliams and Seegerl . 



2001 ) or the incomplete Cholesky decomposition ( Fine and Scheinberel . 200ll ) do not use in 



formation about the task we want to solve, i.e. the response variable we wish to regress upon. 
Random selection does not use any information at all whereas incomplete Cholesky aims at 
reducing the error incurred from approximating the kernel matrix. Supervised choice of the 
subset does take into account the response variabl e and usually proceeds by greedy forward 
selection, using e.g. matching pursuit techniques (jSmola and Bartlettl . liooih . 



However, none of these approaches are directly applicable for sequential learning, since 
the complete set of basis function candidates must be known from the start. Instead, assume 
that the data becomes available only sequentially at t = 1, 2, . . . and that only one pass over 
the data set is possible, so that we cannot select the subset BV in advance. Working in 
the co ntext of Gaussian process regression, ICsato and Opperl (j200lh and later lEngel et al 



(|2003h lave proposed a sparse greedy online approximation: start from an empty set of BV 



and examine at every time step t if the new example needs to be included in BV or if it can 
be processed without augmenting BV. The criterion they employ to make that decision is 
an unsupervised one: at every time step t compute for the new data point Xf the error 

5 t = k{x u x t ) - k m (x 4 ) T K^k m (xi) (9) 

incurred from approximating the new data point using the current BV. If St exceeds a 
given threshold then it is considered as sufficiently different and added to the dictionary 
BV. Note that only the current number of elements in BV at a given time t is considered, 
the contribution from basis functions that will be added at a later time is ignored. 

In this case, it might be instructive to visualize what happens to the txm data matrix 
K tm once BV is augmented. Adding the new element x t to BV means adding a new basis 
function (centered on x t ) to the model and consequently adding a new associated column 
q = (fc(xi, x t ), . . . , k(xt, xt)) T to Kf m . With sparse online approximation all t — 1 past 
entries in q are given by fc(xj, x t ) ~ k m (xj) T K~}„k m (x t ), i = 1 . . . ,t — 1, which is exact for 
the m basis-elements and an approximation for the remaining t — m — 1 non-basis elements. 
Hence, going from m to m + 1 basis functions, we have that 



K-t,m+i — [K 



tm 



Kf_i i7n K.t-l,m a t 
k m (x t ) T k(x t ,x t ) 



(10) 



where a t := K~J n k m (x t ). The overall effect is that now we do not need to access the 
full data set any longer. All costly 0(tm) operations that arise from adding a new col- 
umn, i.e. adding a new basis function, computing the reduction of error during greedy 
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for ward selection of basis functions, or compu ting predictive variance with augmentation as 

,2\ 



in kasmussen and Quinonero Candeli B. now become a more affordable 0(, 

This is exploited in ( Jung and Polani . 20061 ); here a simple modification of the selection 
procedure is presented, where in addition to the unsupervised criterion from ([9]) the contri- 
bution to the reduction of the error (i.e. the objective function one is trying to minimize) 
is taken into account. Since the per-step complexity during training and then later during 
prediction critically depends on the size m of the subset BV, making sure that no unneces- 
sary basis functions are selected ensures more efficient usage of otherwise scarce resources 
and makes learning in real-time (a necessity for keepaway) possible. 



3. Policy evaluation with regularization networks 



We now present an efficient online implementation for least-squares-based policy evaluation 
(applicable to the methods LSPE, LSTD, BRM) to be used in the framework of approximate 
policy iteration (see Figured]). Our implementation combines the aforementioned automatic 
selection of basis functions (from Section I2.2.3j) with a recursive computation of the weight 
vector corresponding to the regularization network (from Section I2.2.2j) to represent the 
underlying Q-function. The goal is to infer an approximation Q(- ; w) of Q n , the unknown 
Q-function of some given policy it. The training examples are taken from an observed 
trajectory xo,xi,X2,... with associated rewards ri,r2, ... where Xj denotes state-action 
tuples Xj := (si,di) and action m = 7r(sj) is selected according to policy it. 



3.1 Stating LSPE, LSTD and BRM with regularization networks 

First, express each of the three problems LSPE in eq. ([2]), LSTD in eq. (j4|) and BRM in 
eq. © in more compact matrix form using regularization networks from (|8|). Assume that 
the dictionary BV contains m basis functions. Further assume that at time t (after having 
observed t transitions) a new transition from x^ to x^+i under reward rt+\ is observed. 
From now on we will use a double index (also for vectors) to indicate the dependence in the 
number of examples t and the number of basis functions m. Define the matrices: 



K 



t+l.m 



k m (x ) T 
k TO (x 4 ) T 

n 

n+i 



H 



t+l,m 



A 



t+i ■'- 



k m (x ) T - 7k m (xi) T 

k m (x 4 ) T - 7k m (x m ) T 
1 (A7) 1 ••• (A 7 )' 



i (At) 1 



(ii) 



where, as before, m x 1 vector k m (-) denotes k m (-) = (fc(-,xi), . . . , fc(-,x m )) 
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3.1.1 The LSPE(A) method 

Then, for LSPE(A), the least-squares problem ([2]) is stated as (wt m being the weight vector 
of the previous step): 

w t+ i, m = argmin I ||K t+ i im w - K t+ i iTn Wt m - A t+1 (r t+1 - H t+ i im w tm ) || 2 
+cr 2 (w - w tm ) T K mm (w - w tm )| 

Computing the derivative wrt w and setting it to zero, one obtains for W( + i >m : 

Wt + l >m = W tm + ( K i+l,m K t+l,m + C 2 K mm ) 1 (Zj +l m T t+ l - Zj +l m H t+ l tm W tm ) 

where in the last line we have substituted Z( + i )m := Aj +1 ~Kt+i,m- From ([3]) the next iterate 
w t+i,m for the weight vector in LSPE(A) is thus obtained by 

Wi+l >m = W tm + fft (w 4+ i /m — -Wtm) = Wtm + f]t {^J+l,m^t+l,m + C 2 K mm ) 

( Z l+l,m r t+l - Z m,m H t+l,mW ta ) (12) 

3.1.2 The LSTD(A) method 

Likewise, for LSTD(A), the fixed point equation is stated as: 

w = argmin I ||K t+ i jm w - K t+ i jm w - A t+ i(r t+ i - H t+ i jm w)|| 2 

w ^ 

+o- 2 w T K mm w|. 

Computing the derivative with respect to w and setting it to zero, one obtains 

( Z m,m H m,m + o- 2 K mm )w = Zj +lm r t+ i. 
Thus the solution w t+ i )m to the fixed point equation in LSTD(A) is given by: 

wt+i,m = (Z?+i jm H t+ i jm + <7 2 K mm ) Z^. ljm r t+ i (13) 

3.1.3 The BRM method 

Finally, for the case of BRM, the least-squares problem ([5]) is stated as: 
w i+ i, m = argmin I \\r t+1 - H t+ i jm w|| 2 + cr 2 w T K mm w I 

w l J 

Thus again, one obtains the weight vector wt+i jm by 

Wt + i jm = (nJ +ljm Ht + i t m + a 2 K mm ) 1 H7+i im r t+ i (14) 
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3.2 Outline of the recursive implementation 

Note that all three methods amount to solving a very similar structured set of linear equa- 
tions in eqs. (fl~2j) . (fl~3|) . (fblj) . Overloading the notation these can be compactly stated as: 



LSPE: solve 
where 



- bt+i,m := Z t+ljm r i+ i 

- A 4+ i )m := Zj|_ lim Ht+i )m 

LSTD: solve 

w t+ i )m = P^ 1)m b t+ i jm (pH) 

where 

- bt+i, m := Zj!_ 1)m r t+ i 
BRM: solve 

w m , m = Pi+ lim bt+i, m JH) 

where 

- bt+i, m := 1>m r t+ i 



Each time a new transitions from to Xt+i under reward rt+i is observed, the goal is to 
recursively 

1. update the weight vector w tm , and 

2. possibly augment the model, adding a new basis function (centered on xt+i) to the 
set of currently selected basis functions BV. 

More specifically, we will perform one or both of the following update operations: 

1. Normal step: Process (x^xj^t+l) using the current fixed set of basis functions BV. 

2. Growing step: If the new example is sufficiently different from the previous examples 
in BV (i.e. the reconstruction error in @ exceeds a given threshold) and strongly 
contributes to the solution of the problem (i.e. the decrease of the loss when adding 
the new basis function is greater than a given threshold) then the current example is 
added to BV and the number of basis functions in the model is increased by one. 
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The update operations work along the lines of recursive least squares (RLS), i.e. propagate 
forward the invers^l of the m x m cross product matrix Pt m . Integral to the derivation of 
these updates are two well-known matrix identities for recursively computing the inverse of 
a matrix: (for matrices with compatible dimensions) 



if B t+1 = B, + bb T then = B, 1 - "ffi^C 
which is used when adding a row to the data matrix. Likewise, 



(15) 



if B 



t+l 



B t h 
b T b* 



then B 



-i 
t+l 







-i 



1 



B t x b 



-Bf A b 



(16) 



with Af> = b* — b T B^ 1 b. This second update is used when adding a column to the data 
matrix. 

An outline of the general implementation applicable to all three of the methods LSPE, 
LSTD, and BRM is sketched in Figure EJ To avoid unnecessary repetitions we will here 
only derive the update equations for the BRM method; the other two are obtained with 
very minor modifications and are summarized in the appendix. 



3.3 Derivation of recursive updates for the case BRM 

Let t be the current time step, (x( + i,r£ + i) the currently observed input-output pair and 
assume that from the past t examples {(xj,rj)}* =1 the m examples {xj}^ were selected 
into the dictionary BV. Consider the penalized least-squares problem that is BRM (restated 
here for clarity) 

min J tm (w) = \\r t - H tm w|| 2 + a 2 w T K mm w (17) 

wei m 

with Htm being the t x m data matrix and being the t x 1 vector of the observed output 



values from (fTTI) , Defining the m x m cross product matrix P 4m = (H^Ht m + c 2 K mm 



d-IttT „ 



the solution to (fT?]) is given by 

Finally, introduce the costs £tm = Jtmi^tm) ■ Assuming that {P^, w fm , ^ tm } are known 
from previous computations, every time a new transition (xt-f 1,^4.1) is observed, we will 
perform one or both of the following update operations: 

3.3.1 Normal step: from {P^,w tm ,Ct m } to {P~^ lim ,w t+ i, m ,Ct+i,m} 
With h t+ i defined as h t+1 := (k m (x t ) - 7k m (x( + i)) T , one gets 



H 



t+l,m 



tin 



H 

n t+i. 



and r t+ i 



n+i 



2. A better alternative (from the standpoint of numerical implementation) would be to not propagate 
forward the inverse, but instead to work with the Cholesky factor. For this paper we chose the first 
method in the first place because it gives consistent update formulas for all three considered problems 
(note that for LSTD the cross-product matrix is not symmetric) and overall allows a better exposition. 
For details on the second way, see e.g. jSavedLl2003h . 
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Relevant symbols: 



// 


7T 


Policy, whose value function Q w we want to estimate 


// 


t 


Number of transitions seen so far 


// 


m 


Current number of basis functions in BV 


// 


pi 


Cross product matrix used to compute wt m 


// 


w tm 


Weights of Q(- ;Wtm), the current approximation to Q* 


// 


"■mm 


Used during approximation of kernel 



Initialization: 

Generate first state so. Choose action ao = 7r(so). Execute ao and observe si and n. 
Choose ai = it(si). Let xo := (so,oo) and bxi := (si,ai). Initialize the set of basis 
functions BV '■= {xq,xi} and K^J,. Initialize Pj~2, w i,2 according to either LSPE, 
LSTD or BRM. Set t := 1 and m :'= 2. 

Loop: For t = 1, 2, . . . 

Execute action at (simulate a transition). 
Observe next state St+i and reward rt+i- 
Choose action ot+i = 7r(st+i)- Let xt+i := (st+i,Ot+i)- 

Step 1: Check, if xt+i should be added to the set of basis functions. 
Unsupervised basis selection: return true if (J9]l> TOLL 
Supervised basis selection: return true if <(9j > T0L1 

and additionally if either (J2U) or |[23]' )> T0L2. 

Step 2: Normal step 

Obtain P^ ljm from either ([15). ((151). or JTSf). 
Obtain w t +i' m from either 1(19). 1(151). or dl9l'). 

Step 3: Growing step (only when step 1 returned true) 
Obtain P t+ 1 l m+1 from either pO).pCTl). or l|2Cil'). 
Obtain w t +i' m+ i from either 1(21). P51). or 1(2^1 '). 
Add x t +i to SV and obtain K m +i, m +i from (I25|l . 
m := m + 1 

i := t + 1, St := s t +i, a t := ot+i 



Figure 2: Online policy evalution with growing regularization networks. This pseudo-code 
applies to BRM, LSPE and LSTD, see the appendix for the exact equations. The 
computational complexity per observed transition is 0(m 2 ). 



Thus Pt + i jm = Ptm + ht+ih^ and we obtain from (fT5|) the well-known RLS updates 

r t+l,rn — r tm ^ \ L °) 

with scalar A = 1 + h^ fl P^h t+ i and 

wt+i, m = w im + ^P^ht+i (19) 

with scalar g = r t+ i — kI +1 W( m . The costs become £t+l,m = Ctm + The set of basis 
functions is not altered during this step. Operation complexity is 0(m 2 ). 
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3.3.2 Growing step: from {P fr i >m ,w t+ i jTO ,& + i, m } to {P t ^ ljTO+1 ,w t+ i jtn+ i,^+i >Tn+ i} 

How to add a BV. When adding a basis function (centered on Xt+i) to the model, we 
augment the set BV with x m+ i (note that x m+ i is the same as x t+ i from above). Define 



C m+U, 



fe(x t ,x t+ i), and k 



t+i 



k(x t j r \, Xf+i). Adding a basis function 
means appending a new (t + 1) x 1 vector q to the data matrix and appending k i+ i as 
row/column to the penalty matrix K mm , thus 



Pt+i,m+i — [Hj+i, m q] [Ht + i i?Ti q] + a 

Invoking (fTUj) we obtain the updated inverse P^_\ m+1 via 





t+l,m+l 



P 1 





+ 



1 



-w 6 
1 



n T 



(20) 



where simple vector algebra reveals that 

w f> = P t+ 1 l,m( H t+l,mq + °" 2k 



t+1; 



A b = q T q + a 2 k* t+l - (Hj + x q + cr 2 k m ) T w fe . 



(21) 



Without sparse online approximation this step would require us to recall all t past examples 
and would come at the undesirable price of 0(tm) operations. However, we are going to 
get away with merely 0(m) operations and only need to access the m past examples in 
the memorized BV. Due to the sparse online approximation, q is actually of the form 



q T — [Ht m at + i 



h 



t+ii 



with h* +1 := k^-^kl +l and a m = K m ^k t+ i (see Section [2X3]) . 



Hence new information is injected only through the last component. Exploiting this special 
structure of q equation ([2T]) becomes 



w 6 = a t 



(22) 



where 6h = h^ +l — hj +1 a.t+i. If we cache and reuse those terms already computed in the 
preceding step (see Section 13,3. ip then we can obtain w&, A& in 0(m) operations. 

To obtain the updated coefficients yvt+i,m+i we postmultiply (f20|) by H^_ lim+1 rt + i = 

[ H 7+l,m r *+l 



q Tr t+i] T , getting 



W(+l, m +l 







-w 6 

1 



(23) 



where scalar k is defined by k = rj +1 (q — Ht+i, m Wb)/Ab. Again we can now exploit the 
special structure of q to show that k is equal to 

A h A 
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And again we can reuse terms computed in the previous step (see Section 13.3. 1 [) . 

Skipping the computations, we can show that the reduced (regularized) cost £ t +i,m+i is 
recursively obtained from £t+i,m via the expression: 

£m,m+l = £t+l,m — K 2 Ab. (24) 

Finally, each time we add an example to the BV set we must also update the inverse kernel 
matrix K"^ needed during the computation of Sit+i an d &h- This can be done using the 
formula for partitioned matrix inverses (|16p : 

K 1 

xv m+l,m+l — 



K" 1 



+ 



1 



-a*+i 
1 



(25) 



When to add a BV. To decide whether or not the current exampl e x^-n should be 
adde d to the BV set, we employ the supervised two-part criterion from (jjung and Polanil . 
20061 ). The first part measures the 'novelty' of the current example: only examples that are 



'far' from those already stored in the B V set are considered for inclusion. To this end we 
compute as in (jCsato and Qpperi . lioOlT ) the squared norm of the residual from projecting 
(in RKHS) the example onto the span of the current BV set, i.e. we compute, restated 
from ©, 5 = kl +1 - \c[ +1 a t+ i. If 5 < T0L1 for a given threshold T0L1, then ~x.t+i is well 
represented by the given BV set and its inclusion would not contribute much to reduce the 
error from approximating the kernel by the reduced set. On the other hand, if 5 > T0L1 
then xt+i is not well represented by the current BV set and leaving it behind could incur a 
large error in the approximation of the kernel. 

Aside from novelty, we consider as second part of the selection criterion the 'usefulness' 
of a basis function candidate. Usefulness is taken to be its contribution to the reduction of 
the regularized costs £t m , i.e. the term K 2 Af, from (|24p . Both parts together are combined 
into one rule: only if 5 > T0L1 and 6k 2 A& > T0L2, then the current example will become a 
new basis function and will be added to BV. 



4. RoboCup-keepaway as RL benchmark 

The experimenta l work we carried o ut for this article uses the publicly availabl^l keepaway 



framework from (| St one et all 120051 ). which is built on top of the standard RoboCup soccer 



simulator also used for official competitions ( Noda et al. . 19981 ). Agents in RoboCup are 



autonomous entities; they sense and act independently and asynchronously, run as individ- 
ual processes and cannot communicate directly. Agents receive visual perceptions every 150 
msec and may act once every 100 msec. The state description consists of relative distances 
and angles to visible objects in the world, such as the ball, other agents or fixed beacons 
for localization. In addition, random noise affects both the agents sensors as well as their 
actuators. 

In keepaway, one team of 'keepers' must learn how to maximize the time they can 
control the ball within a limited region of the field against an opposing team of 'takers'. 
Only the keepers are allowed to learn, the behavior of the takers is governed by a fixed set 



3. Sources are available from http://www.cs.utexas.edu/users/AustinVilla/sim/keepaway/ 
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Figure 3: Illustrating keepaway. The various lines and angles indicate the 13 state variables 
making up each sensation as provided by the keepaway benchmark software. 



of hand-coded rules. However, each keeper only learns individually from its own (noisy) 
actions and its own (noisy) perceptions of the world. The decision-making happens at an 
intermediate level using multi-step macro-actions; the keeper currently controlling the ball 
must decide between holding the ball or passing it to one of its teammates. The remaining 
keepers automatically try to position themselves such to best receive a pass. The task is 
episodic; it starts with the keepers controlling the ball and continues as long as neither the 
ball leaves the region nor the takers succeed in gaining control. Thus the goal for RL is to 
maximize the overall duration of an episode. The immediate reward is the time that passes 
between individual calls to the actin g agent. 

For our work, we consider as in (IStone et all l2005h the special 3vs2 keepaway problem 



(i.e. three learning keepers against two takers) played in a 20x20m field. In this case the 
continuous state space has dimensionality 13, and the discrete action space consists of the 
three different actions hold, pass to teammate- 1, pass to teammate-2 (see Figure [3]). More 
generally, larger instantiations of keepaway would also be possible, like e.g. 4vs3, 5vs4 or 
more, resulting in even larger state- and action spaces. 

5. Experiments 

In this section we are finally ready to apply our proposed approach to the keepaway problem. 
We implemented and compared two different variations of the basic algorithm in a policy 
iteration based framework: (a) Optimistic policy iteration using LSPE(A) and (b) Actor- 
critic policy iteration using LST D(A). As baseline m ethod we used Sarsa(A) with tilecoding, 
which we re- implemented from ( Stone et al. . 20051 ) as faithfully as possible. Initially, we 



also tried to employ BRM instead of LSTD in the actor-critic framework. However, this 
set-up did not fare well in our experiments because of the stochastic state-transitions in 
keepaway (resulting in highly variable outcomes) and BRM's inability to deal with this 
situation adequately. Thus, the results for BRM are not reported here. 

Optimistic policy iteration. Sarsa(A) and LSPE(A) paired with optimistic policy iter- 
ation is an on-policy learning method, meaning that the learning procedure estimates the 
Q-values from and for the current policy being executed by the agent. At the same time, the 
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agent continually updates the policy according to the changing estimates of the Q-function. 
Thus policy evaluation and improvement are tightly interwoven. Optimistic policy iteration 
(OPI) is an online method that immediately processes the observ ed transitions as they be- 



come available from the agent interacting with the environment (jBertsekas and Tsitsiklisl . 
199fil : ISutton and Bartol . 1 19981 ) . 



Actor-critic. In contrast, LSTD(A) paired with actor-critic is an off-policy learning method 
adhering with more rigor to the policy iteration framework. Here the learning procedure 
estimates the Q-values for a fixed policy, i.e. a policy that is not continually modified to 
reflect the changing estimates of Q. Instead, one collects a large number of state transitions 
under the same policy and estimates Q from these training examples. In OPI, where the 
most recent version of the Q-function is used to derive the next control action, only one 
network is required to represent Q and make the predictions. In contrast, the actor-critic 
framework maintains two instantiations of regularization networks: one (the actor) is used 
to represent the Q-function learned during the previous policy evaluation step and which 
is now used to represent the current policy, i.e. control actions are derived using its predic- 
tions. The second network (the critic) is used to represent the current Q-function and is 
updated regularly. 

One advantage of the actor-critic approach is that we can reuse the same set of obs erved 
transitions to evaluate different policies, as proposed in ( Lagoudakis and Parr . 20031 ) . We 



maintain an ever-growing list of all transitions observed from the learning agent (irrespective 
of the policy), and use it to evaluate the current policy with LSTD(A). To reflect the real- 
time nature of learning in RoboCup, where we can only carry out a very small amount 
of computations during one single function call to the agent, we evaluate the transitions 
in small batches (20 examples per step). Once we have completed evaluating all training 
examples in the list, the critic network is copied to the actor network and we can proceed 
to the next iteration, starting anew to process the examples, using this time a new policy. 

Policy improvement and e-greedy action selection. To carry out policy improve- 
ment, every time we need to determine a control action for an arbitrary state s* , we choose 
the action a* that achieves the maximum Q-value; that is, given weights and a set of 
basis functions {xi, . . . , x m }, we choose 

a* = argmax Q(s*,o;wfc) = argmax k m (s*, a) T Wfc. 

a a 

Sometimes however, instead of choosing the best (greedy) action, it is recommended to try 
out an alternative (non-greedy) action to ensure sufficient exploration. Here we employ the 
e-greedy selection scheme; we choose a random action with a small probability e (e = 0.01), 
otherwise we pick the greedy action with probability 1 — e. Taking a random action usually 
means to choose among all possible actions with equal probability. 



Under the standard assumption for errors in Bayesian regression (e.g., see lRasmussen and Williams! . 



2006), namely that the observed target values differ from the true function values by 



an 



additive noise term (i.i.d. Gaussian noise with zero mean and uniform variance), it is also 
possible to obtain an expression for the 'predictive variance' which measures the uncertainty 
associated with value predictions. The availability of such confidence intervals (which is pos- 
sible for the direct least-squares problems LSPE and also BRM) could be used, as suggested 
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m (jEngel et all l2005ah . to guide the choice of actions during exploration and to increase the 



overall performance. For the purpose of solving the keepaway problem however, our initial 
experiments showed no measurable increase in performance when including this additional 
feature. 

Remaining parameters. Since the kernel is defined for state-acti on tuples, w e emplo y 
a product kernel k([s, a], [s r , a']) = ks(s, s')kA{a,a r ) as suggested by Engel et al. ( 2005al ). 



The action kernel k^a, a') is taken to be the Kronecker delta, since the actions in keepaway 
are discrete and disparate. As state kernel k$(s, s') we chose the Gaussian RBF ks(s, s') = 
exp(— h \\s — s'\\ 2 ) with uniform length-scale h~ l = 0.2. The other parameters were set 
to: regularization a 2 = 0.1, discount factor for RL 7 = 0.99, A = 0.5, and LSPE step size 
rjt = 0.5. The novelty parameter for basis selection was set to T0L1 = 0.1. For the usefulness 
part we tried out different values to examine the effect supervised basis selection has; we 
started with T0L2 = corresponding to the unsupervised case and then began increasing 
the tolerance, considering alternatively the settings T0L2 = 0.001 and T0L2 = 0.01. Since in 
the case of LSTD we are not directly solving a least-squares problem, we use the associated 
BRM formulation to obtain an expression for the error reduction in the supervised basis 
selection. Due to the very long runtime of the simulations (simulating one hour in the 
soccer server roughly takes one hour real time on a standard PC) we could not try out 
many different parameter combinations. The parameters governing RL were set according 
to our experiences with smaller problems and are in the range typically reported in the 
literature. The parameters governing the choice of the kernel (i.e. the length-scale of the 
Gaussian RBF) was chosen such that for the unsupervised case (T0L2 = 0) the number of 
selected basis functions approaches the maximum number of basis functions the CPU used 
for these the experiments was able to process in real-time. This number was determined to 
be ~ 1400 (on a standard 2 GHz PC). 

Results. We evaluate every algorithm/parameter configuration using 5 independent runs. 
The learning curves for these runs are shown in Figure 01 The curves plot the average 
time the keepers are able to keep the ball (corresponding to the performance) against the 
simulated time the keepers were learning (roughly corresponding to the observed training 
examples). Additionally, two horizontal lines indicate the scores for the two benchmar k 



policies random behavior and optimized hand-coded behavior used in (jStone et al.l . 120051 ). 

The plots show that generally RL is able to learn policies that are at least as effective as 
the optimized hand-coded behavior. This is indeed quite an achievement, considering that 
the latter is the product of considerable manual effort. Comparing the three approaches 
Sarsa, LSPE and LSTD we find that the performance of LSPE is on par with Sarsa. The 
curves of LSTD tell a different story however; here we are outperforming Sarsa by 25% 
in terms of performance (in Sarsa the best performance is about 15 seconds, in LSTD the 
best performance is about 20 seconds). This gain is even more impressive when we consider 
the time scale at which this behavior is learned; just after a mere 2 hours we are already 
outperforming hand-coded control. Thus our approach needs far fewer state transitions 
to discover good behavior. The third observation shows the effectiveness of our proposed 
supervised basis function selection; here we show that our supervised approach performs as 
well as the unsupervised one, but requires significantly fewer basis functions to achieve that 
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$ure 4: Prom left to right: Learning curves for our approach with LSTD (T0L2=0), LSTD 
(T0L2=0.001), LSTD (T0L2=0.01), and LSP E. At the bottom w e show the curves 
for Sarsa with tilecoding corresponding to ( Stone et all 20051 ). We plot the av- 
erage time the keepers are able to control the ball (quality of learned behavior) 
against the training time. After interacting for 15 hours the performance does not 
increase any more and the agent has experienced roughly 35,000 state transitions. 
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level of performance (~ 700 basis functions at T0L2= 0.01 against 1400 basis functions at 
T0L2= 0). 

Regarding the unexpectedly weak performance of LSPE in comparison with LSTD, we 
conjecture that this strongly depends on the underlying architecture of policy iteration (i.e. 
OPI vs. actor-critic) as well as the specific learning problem. On a related number of 
experiments carried out with the octopus arm be nchmark we made exa ctly the opposite 



observation (not discussed here in more detail, see lJung and Polani 120071 ) 



6. Discussion and related work 

We have presented a kernel-based approach for least-squares based policy evaluation in RL 
using regularization networks as underlying function approximator. The key point is an effi- 
cient supervised basis selection mechanism, which is used to select a subset of relevant basis 
functions directly from the data stream. The proposed method was particularly devised 
with high-dimensional, stochastic control tasks for RL in mind; we prove its effectiveness 
using the RoboCup keepaway benchmark. Overall the results indicate that kernel-based 
online learning in RL is very well possible and recommendable. Even the rather few sim- 
ulation runs we made clearly show that our approach is superior to convential function 
approximation in RL using grid-based tilecoding. What could be even more important is 
that the kernel-based approach only requires the setting of some fairly general parameters 
that do not depend on the specific control problem one wants to solve. On the other hand, 
using tilecoding or a fixed basis function network in high dimensions requires considerable 
manual effort on part of the programmer to carefully devise problem-specific features and 
manually choose suitable basis functions. 

Engel et al. (2003, l2005al ) initially advocated using kernel-based methods in RL and 



proposed the related GPTD algorithm. Our method using regularization networks develops 
this idea further. Both methods have in c ommon the online selection of relevant basis 
functions based on dCsato and Opperi l200lK As opposed to the unsupervised selection in 



GPTD, we use a supervised criterion to further reduce the number of relevant basis functions 
selected. A more fundamental difference is the policy evaluation method addressed by the 
respective formulation; GPTD models the Bellman residuals and corresponds to the BRM 
approach (see Section 2.1.2). Thus, in its original formulation GPTD can be only applied 
to RL problems with deterministic state transitions. In contrast, we provide a unified and 
concise formulation of LSTD and LSPE which can deal with stochastic state transitions as 
well. Another difference is the type of benchmark problem used to showcase the respective 
method; GPTD was demonstrated by learning to c ontrol a simulated o ctopus arm, which 



was posed as an 88-dimensional control problem (jEngel et all l2005bl ). Controlling the 
octopus arm is a deterministic control problem with known state transitions and was solved 
there using model-based RL. In contrast, 3vs2 keepaway is only a 13-dimensional problem; 
here however, we have to deal with stochastic and unknown state transitions and need to 
use model- free RL. 



4. From the ICML06 RL benchmarking page: 

http : //www. cs .mcgill . ca/dprecup/workshops/ICML06/oct opus .html 
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Appendix A. A summary of the updates 

Let Xt_|_i = (st+i, at+i) be the next state-action tuple and r t +\ be the reward assiociated 
with transition from the previous state Sf to s t+ i under Of. Define the abbreviations: 

k t := k m (xt) k t+ i := k m (x m ) h t+ i := k t - 7k m 

k* t := k(x t ,x t+ i) k* t+l := k(x t+1 ,x t+1 ) h* +1 := k* t - -fk* t+1 

and a.t+1 ■= K mm k t+i- 



A.l Unsupervised basis selection 

We want to test if Xi+i is well represented by the current basis functions in the dictionary 
or if we need to add xt+i to the basis elements. Compute 

S = fc*_ x - k7 +1 a m . © 

If 5 < T0L1, then add Xj+i to the dictionary, execute the growing step (see below) and 
update 



K 1 

'Hn+l.m+l 



K 




mrn 
J 



1 

+ ~5 



-a t+ i 
1 



-a*+i 
1 



A. 2 Recursive updates for BRM 

• Normal step {t, m} i— V {t + 1, m}: 
1. 



P 1 



_= p 

+l,m im 



1 ftmb-t+ih^ iP tm 



with A = l + h7 +1 Pjh m . 



w t+ i im - w tm + — P t „Jh t+ i 



with q = r t+1 - hj +1 w tm . 
Growing step {t + 1, m} i-> {t + 1, m + 1} 
1. 



>-l 

t+l,m+l 



P 1 









+ 



-w 6 
1 



-w 6 
1 



n T 



where 



w b = a m + ^P^ht+i, A b 



Si 







A 



h + a 2 5 h , S h = h* t+1 - h f T +1 a t+ 



+i d m 
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w *+l,m+l 







+ K 



-w 6 

1 



where k 



She 
'A h A- 



• Reduction of regularized cost when adding x^+i (supervised basis selection): 

£t+l,m+l = £t+i,m — « 2 Aft 
For supervised basis selection we additionally check if k 2 A& > TDL2. 

A. 3 Recursive updates for LSTD(A) 

• Normal step {t, m} i— > {t + 1, m}: 

1. 

Zt+l,m = (7A)z im + k 4 

2. 



-1 Ptm^+l.mh^iP^ 



r (+l,m ~ 

with A = 1 + h7 +1 PtoZi + i im . 



tm 



A 



with q = r t +i - hj +1 w tm . 
Growing step {t + 1, m} i— > {i + 1, m + 1} 



C3 



Zt+l, m +l — [ z ?+l,m z i+l,m] 

where = (7A)z7 m a m + fc*. 



2. 



»-l 

i+l,m+l 







p 1 

r t+l,m 




1 



W 



(1) 



(2) i 



where 



w 



(i) 



a*+i + -£-P tm Z t+ l,m 



w 



(2) 



a T + ^hj P- 

+1 A +1 m 



J(2) 



and A 6 =^+a 2 (A;* +1 



k T 



W(+l, m +l 







-w 



(i) 



(E3T 



where k 



"A b A- 



54 



Learning Keepaway with Kernels 



A. 4 Recursive updates for LSPE(A) 

• Normal step {t,m} \— > {t + l,m}: 

1. 

%t+l,m 



(jX)z tm + k 4+ i 

Atm + Zt+l,mh t+1 



p-1 _ p-1 _ ^ > tm^t+l^ i J+l^'trn 

.m — "tm a 



t+l,m tm 

with A = l + k7 +1 P^k m . 

3. 

= w im + 77P f+li7n (b t+ i iTO - A 4+ i jm w tri 
Growing step {t + 1, m} i-> {t + 1, m + 1} 
1. 



z t+l,m+l 
A, 



Zi+l,m 



bt+l,, n +l — 



bt+l, m 
a t+lb*m + z t+l,m r t+l 



2. 



where z* t+l m = (-f\)zj m a t+ i + k* t . 



P 1 

1 t+l,m+l 







1 



-w 6 
1 



-w 6 

1 



where 



5 <5 2 
w b = a t+ i + — P^kt+i, A 6 = — + cr 2 <5, 



5 = kt - k7a t+ i 



and A b = ^ + a 2 (fc t * +1 - k^a^i) 



Wt4-i >m 





— w 



(i) 



where k = -gf. 



Reduction of regularized cost when adding xt+i (supervised basis selection): 

&+i,m+i = 6+i,m - A^(c - w^"d) 2 

where c = a^ 1 (bt m -A tm w tTO )+4+l,m( r *+i _h 7+l w tm) and d = b t+lim -A t+ i iri 
For supervised basis selection we additionally check if A 6 " 1 (c — w^"d) 2 > TDL2. 
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